Introduction

In this R Notebooks, I will explore the distribution of Zipcodes accross DMAs. When we map the DMAs based on the zipcodes, we get that a given DMA is widespread accross the US as shown in the map below. This is noise in the data. But what percentage is noise?

dma_zip

dma_zip


Hypothesis

For a given zipcode, the DMA that it most frequently is associated with is the correct zipcode-DMA association. All the rest are outliers. Therefore, we will calculate the percentage of occurance of a given zipcode accross different DMAs to detect the outliers.


library(data.table)
library(DT)
library(plotly)
library(tidyverse)
df <- fread("Documents/gannet/clustering/EDA/traffic_by_host_status_dma.csv")
df = df %>% select(geo_dma, geo_zip, count)

Group by geo_zip and geo_dma

The table below shows sample of records of geo_zip and geo_dma with their count. These are the most frequent ones.

dma_zip = df %>% group_by(geo_dma, geo_zip) %>% summarise(count_dma_zip = sum(count)) %>% arrange(desc(count_dma_zip))
datatable(dma_zip %>% head())


Group by zipcode

The table below shows the total instances by zipcode. It will help us to analyze what percentage of records of a given zipcode is found in a given DMA.

zip = df %>% group_by(geo_zip) %>% summarise(count_zip = sum(count)) %>% arrange(desc(count_zip))
datatable(zip %>% head())


Calculate percentage

Now, let’s calculate percentage of a given zipcode records across a DMA.

joined = dma_zip %>% inner_join(zip, by = 'geo_zip')
joined  = joined %>% mutate(percentage = count_dma_zip/count_zip * 100) #%>% arrange(desc(percentage))
data.table(joined)


Distribution of the percentages

Let’s see the distibution of the percentages.

g1 = joined %>% ggplot(aes(x = percentage)) + geom_histogram()
ggplotly(g1)

We see two peaks near 0 and 100. Let’s use log scale to further understand the distribution.

g2 = joined %>% ggplot(aes(x = log10(percentage))) + geom_histogram()
ggplotly(g2)


Remarks:


How many zipcodes have maximum percentage more than 95%?

joined %>% filter(percentage > 95) %>% distinct(geo_zip) %>% nrow()
[1] 32793


How many unique zipcodes do we have in the data?

df %>% select(geo_zip) %>% distinct() %>% nrow()
[1] 33089

We see that 296 zipcodes (33089-32793), which is about 0.9%, have less than 95% maximum percentage of occrance in a given DMA.


Let’s see their distibution.

z1 = df %>% select(geo_zip) %>% dplyr::distinct() 
z2 = joined %>% filter(percentage > 95) %>% dplyr::distinct(geo_zip)
less_95 = z1 %>% anti_join(z2,by = "geo_zip")

g2 = joined %>% semi_join(less_95) %>% dplyr::group_by(geo_zip) %>% summarise(percentage = max(percentage)) %>%
  ggplot(aes(x = percentage)) + geom_histogram()
ggplotly(g2)


How many records are outliers

z1 = joined %>% dplyr::group_by(geo_zip) %>% summarise(percentage = max(percentage))

outliers = joined  %>% anti_join(z1, by = c("geo_zip", "percentage")) %>% select(geo_dma, geo_zip, count_dma_zip, percentage)
cat('There are ', sum(outliers$count_dma_zip), 'outlier zipcodes out of a total of', sum(df$count), 'records')
There are  1123285 outlier zipcodes out of a total of 451653467 records
cat('This is', sum(outliers$count_dma_zip)/sum(df$count)*100,  'percent')
This is 0.2487051 percent


Summary

So, there are less than 0.25% of zipcodes outside of their DMA (outliers) assuming that those with maximum percentages for each geo-zip and geo-dma combination are correct and after filtering them out we get the map below. dma_zip_clean